This document includes data analysis tasks related to the course project of Introduction to R Programming of 365 Data Science platform.
According to the project overview, the goal is to get an understanding on how the characteristics of housing market influence other important characteristics such as the home value. In other words, using the available data, stakeholders want to infer and make predictions about the housing market in the observed environment.
To do so, a dataset with the following variables is provided:
As the above list of variables shows, there are statistical measures,
such as rate, average, and median (namely
Crime.Rate, Average.Rooms, and
Median.Home.Value), associated with each observation in the
dataset. This suggests that each observation is related to a set of
elements, housing units in this case, from which these statistical
measures are calculated.
However, the variables that appear not to be statistical measures,
such as Public.Transport.Access and
Number.of.Schools, are constant values that apply for all
the elements in the set, that is for all the housing units, with which
each observation is associated.
It appears that the observations represent some kind of areas, for example, neighborhoods, or blocks, which is not stated explicitly. For the rest of the analysis, keeping this aspect of the dataset abstract and not clearly specified, the term housing area will be used.
In that light, we have a pre-collected data on housing areas, where each housing area has characteristics of:
Due to the lack of a detailed description of the provided dataset, there are a few uncertainties that one might be concerned about. These uncertainties might turn out to be unimportant regarding to the goal of the analysis yet it is better to make notes of them.
Let’s see if the first 10 rows of the dataset provide additional information or raise more questions…
## Crime Rate Average Rooms Public Transport Access Number of Schools
## 1 NA 5.585324 10 3
## 2 2.654339 5.395206 3 6
## 3 4.619221 6.033965 9 4
## 4 6.807575 5.418335 10 5
## 5 2.414617 6.189320 2 4
## 6 2.414658 5.964833 6 4
## 7 6.948032 5.832736 7 4
## 8 4.918587 5.364705 9 4
## 9 1.826314 5.596260 3 6
## 10 NA 6.528774 10 4
## Median Home Value
## 1 47.90077
## 2 41.53910
## 3 48.51757
## 4 42.50757
## 5 51.39125
## 6 49.64657
## 7 48.76959
## 8 38.21798
## 9 44.69063
## 10 52.59876
As already noted, the type of housing areas and housing units, with
which the observations are associated is unclear. We also do not know
the currency or the scale in which the home values
(Median Home Value) are expressed. The unit, in which the
Public Transport Access variable is expressed is also
unknown. The Number of Schools seems to be an obvious
measure, however, the type of schools these number refer to is not
known. Crime Rate is also a vague indicator - even though
the project describes it as a per capita measurement, one might be
suspicious looking at those high rates in the above sample of the
dataset.
As for the Public Transport Access variable, we could
also think about it as a central measure of the housing units in the
area. A constant value might indicate that the area encloses housing
units that do not differ in physical location, while a central measure
might indicate the aggregate of a housing area with related yet
physically spread-out housing units.
Until a more detailed investigation suggests otherwise, these concerns are considered as resolved and handled as follows:
The project guide advises to explore and analyze the dataset to understand the overall structure and characteristics of the data by:
# package imports related to all tasks
library(tidyverse)
library(magrittr)
library(ggcorrplot)
# seed
set.seed(1234)
The data is given as a CSV file called
housing_data.csv. The following code loads the data into
the object df.housing as a dataframe, keeping the original
column names intact.
# load dataset
df.housing <- read.csv("../data/housing_data.csv", check.names = FALSE)
The imported dataset confirms the previously introduced variables.
# original colnames
val.df.housing.original_colnames <- df.housing |>
colnames()
val.df.housing.original_colnames |>
print()
## [1] "Crime Rate" "Average Rooms"
## [3] "Public Transport Access" "Number of Schools"
## [5] "Median Home Value"
The dataset has the following number of rows, that is observations:
# number of rows
df.housing |>
nrow()
## [1] 506
The dataset has the following structure:
# structure
df.housing |>
str()
## 'data.frame': 506 obs. of 5 variables:
## $ Crime Rate : num NA 2.65 4.62 6.81 2.41 ...
## $ Average Rooms : num 5.59 5.4 6.03 5.42 6.19 ...
## $ Public Transport Access: int 10 3 9 10 2 6 7 9 3 10 ...
## $ Number of Schools : int 3 6 4 5 4 4 4 4 6 4 ...
## $ Median Home Value : num 47.9 41.5 48.5 42.5 51.4 ...
Before proceeding any further, the variable names get transformed into a form better suitable for the rest of the analysis.
# cleaning variable names
df.housing <- df.housing |>
janitor::clean_names()
df.housing |>
colnames()
## [1] "crime_rate" "average_rooms"
## [3] "public_transport_access" "number_of_schools"
## [5] "median_home_value"
Also, since R does not have a built-in mode function, the project guide advises implementing one.
# mode function
## returns mode(s) if there are no more than 3 modes in the given variable,
## otherwise returns NA
fn.mode <- function(x) {
t <- table(x)
mode <- names(t)[which(t == max(t))]
## max 3 modes in a variable
if (length(mode) > 3) {
mode <- NA
}
return(as.numeric(mode))
}
There are missing values in the variables according to the following table:
# column-wise presence of NA
## function
fn.hasNA <- function(df) {
(
any(is.na(df))
)
}
## table
out.presenceNA <- df.housing |>
sapply(fn.hasNA)
names(out.presenceNA) <- paste(names(out.presenceNA), "has NA", sep = " ")
out.presenceNA |>
print()
## crime_rate has NA average_rooms has NA
## TRUE TRUE
## public_transport_access has NA number_of_schools has NA
## FALSE FALSE
## median_home_value has NA
## FALSE
# summary of variables
df.housing |>
summary()
## crime_rate average_rooms public_transport_access number_of_schools
## Min. : 0.005305 Min. :4.112 Min. : 1.000 Min. : 0.000
## 1st Qu.: 1.299938 1st Qu.:5.598 1st Qu.: 3.000 1st Qu.: 4.000
## Median : 3.031481 Median :6.033 Median : 5.000 Median : 5.000
## Mean : 3.137415 Mean :6.026 Mean : 5.421 Mean : 4.992
## 3rd Qu.: 4.584798 3rd Qu.:6.460 3rd Qu.: 8.000 3rd Qu.: 6.000
## Max. :12.631829 Max. :7.801 Max. :10.000 Max. :10.000
## NA's :25 NA's :15
## median_home_value
## Min. :31.55
## 1st Qu.:43.23
## Median :46.91
## Mean :47.10
## 3rd Qu.:50.85
## Max. :62.56
##
The mode of the variables are, respectively:
# mode of variables
df.housing |>
sapply(fn.mode)
## crime_rate average_rooms public_transport_access
## 0.1 NA 2.0
## number_of_schools median_home_value
## 5.0 NA
Being a numerical type, this variable contains decimal values from approximately \(0\) to no more than \(13\). The mean is slightly larger than the median, that is, the frequency distribution of the variable is slightly stretched to the right, exhibiting right skewness. There are \(25\) observations with missing value for this variable. The mode is \(0.1\), however being a continuous variable, a binned mode shall be identified by the help of visualization.
The distribution of the variable, visualized in a histogram, using a binwidth of \(0.5\), is the following:
# histogram
df.housing |>
ggplot() +
geom_histogram(
aes(crime_rate),
binwidth = 0.5,
boundary = 0.5,
closed = "left",
col = "purple",
fill = "purple",
alpha = 0.75
) +
scale_x_continuous(breaks = seq(0, 13, by = 1)) +
labs(
title = "Frequency distribution of Crime Rate",
subtitle = "bin intervals are left side closed",
x = "Crime Rate",
y = "Frequency"
)
# boxplot
df.housing |>
ggplot() +
geom_boxplot(
aes(crime_rate),
fill = "cornflowerblue"
) +
scale_x_continuous(
breaks = seq(0, 13, by = 1)
) +
labs(
title = "Crime Rate intervals of equal parts of the dataset",
x = "Crime Rate"
) +
theme(
axis.text.y = element_blank(),
axis.ticks.y = element_blank()
)
To sum up:
Being a numerical type, this variable contains decimal values from \(4.112\) to approximately no more than \(8\). The mean is almost exactly the same as the median, that is, no skewness is expected. There are \(15\) observations with missing value for this variable. There is no mode, however being a continuous variable, a binned mode shall be identified by the help of visualization.
The distribution of the variable, visualized in a histogram, using a binwidth of \(0.25\), is the following:
# histogram
df.housing |>
ggplot() +
geom_histogram(
aes(average_rooms),
binwidth = 0.25,
boundary = 0.25,
closed = "left",
col = "purple",
fill = "purple",
alpha = 0.75
) +
scale_x_continuous(breaks = seq(0, 8, by = 0.25)) +
labs(
title = "Frequency distribution of Average Rooms",
subtitle = "bin intervals are left side closed",
x = "Average Rooms",
y = "Frequency"
)
To sum up:
Being an integer type, this variable contains whole numbers from \(1\) to \(10\). The mean is larger than the median, that is, the frequency distribution of the variable is stretched to the right, exhibiting right skewness. There are no missing values related to this variable, the mode is \(2\).
The distribution of the variable, visualized in a histogram, using a binwidth of \(1\), is the following:
# histogram
df.housing |>
ggplot() +
geom_histogram(
aes(public_transport_access),
binwidth = 1,
boundary = 1,
closed = "left",
col = "purple",
fill = "purple",
alpha = 0.75
) +
scale_x_continuous(
limits = c(1, 11),
breaks = seq(0, 11, by = 1)
) +
labs(
title = "Frequency distribution of Public Transport Access",
subtitle = "bin intervals are left side closed",
x = "Public Transport Access",
y = "Frequency"
)
To sum up:
Being an integer type, this variable contains whole numbers from \(0\) to \(10\). The mean is almost exactly the same as the median, that is, no skewness is expected. There are no missing values related to this variable. The mode is \(5\).
The distribution of the variable, visualized in a histogram, using a binwidth of \(1\), is the following:
# histogram
df.housing |>
ggplot() +
geom_histogram(
aes(number_of_schools),
binwidth = 1,
boundary = 1,
closed = "left",
col = "purple",
fill = "purple",
alpha = 0.75
) +
scale_x_continuous(
limits = c(0, 11),
breaks = seq(0, 11, by = 1)
) +
labs(
title = "Frequency distribution of Number of Schools",
subtitle = "bin intervals are left side closed",
x = "Number of Schools",
y = "Frequency"
)
To sum up:
Being a numerical type, this variable contains decimal values from \(31.55\) to \(62.56\). The mean is slightly larger than the median, that is, the frequency distribution of the variable is slightly stretched to the right, exhibiting right skewness. There are no missing values related to this variable, nor does an exact mode exist, however being a continuous variable, a binned mode shall be identified by the help of visualization.
The distribution of the variable, visualized in a histogram, using a binwidth of \(2\), is the following:
# histogram
df.housing |>
ggplot() +
geom_histogram(
aes(median_home_value),
binwidth = 2,
boundary = 2,
closed = "left",
col = "purple",
fill = "purple",
alpha = 0.75
) +
scale_x_continuous(
breaks = seq(0, 65, by = 2)
) +
labs(
title = "Frequency distribution of Median Home Value",
subtitle = "bin intervals are left side closed",
x = "Median Home Value",
y = "Frequency"
)
To sum up:
The following details, which, in fact, were not available at the time of the start of the project (the project hid this information until submission / it can be viewed as delayed information about the dataset) might shed light to some of the uncertainties and help resolving the concerns that had been stated previously.
A housing area represents a neighborhood and public transport access indicates the number of buses passing through the neighborhood within an hour.
Looking at the distribution of the variables of the dataset, we can settle on the scale, on which the median home value is measured, which is 1000 USD.
The crime rate scale seems to remain vaguely specified according to the following partially resolved concerns.
A crime rate is a standardized measure to compare the number of crimes in areas that differ in their population. It is usually expressed with the number of crimes projected to a population of 1.000 or 100.000. For example, two areas with crimes occurring 100, and 500 times a year in a population of 10.000, and 50.000, respectively, result in the same crime rate of 10 expressed for 1.000 people
The project does not specify the number of people the crime rate is expressed for, however, looking at the histogram and boxplot of crime rate might help. The boxplot indicates that more than 75% of the dataset is above 1. That is, in 75% of the observed housing areas exhibit a crime for each of the person living in the area. Unless the data aims to cover areas with high crime rate (which is not the case), the crime rate must indicate a value expressed for more than one person, such as 1.000 or 100.000.
The project also does not specify the type of crime measured by the crime rate variable. It could mean 0 to 13 crimes per a 1.000 or per a 100.000 depending on the seriousness of the crimes involved in this measurement.
The project advises looking at whether there is a correlation between any of the variables.
A correlation matrix of the variables is computed and visualized as follows:
# correlation matrix
## matrix
cor(df.housing, use = "complete.obs")
## crime_rate average_rooms public_transport_access
## crime_rate 1.00000000 0.109411375 0.014246404
## average_rooms 0.10941138 1.000000000 -0.003768297
## public_transport_access 0.01424640 -0.003768297 1.000000000
## number_of_schools 0.02442190 0.005000546 0.035876982
## median_home_value 0.09161033 0.888351070 0.010709022
## number_of_schools median_home_value
## crime_rate 0.024421905 0.091610332
## average_rooms 0.005000546 0.888351070
## public_transport_access 0.035876982 0.010709022
## number_of_schools 1.000000000 0.004667281
## median_home_value 0.004667281 1.000000000
## visualization
df.housing |>
## use original column names
rename_with(~val.df.housing.original_colnames) |>
cor(use = "complete.obs") |>
ggcorrplot(
colors = c("black", "white", "purple"),
method = "circle",
type = "upper"
) +
labs(
title = "Correlation of Housing Market data variables"
)
## coefficient of correlation
val.cor.avg_rooms_median_home_value <- cor(df.housing$average_rooms, df.housing$median_home_value, use = "complete.obs")
Other than a few very weak positive correlations between the variables, only one stands out significantly, which is the correlation between median home value and average rooms with a correlation coefficient of \(0.8884\).
The following scatterplot also displays how close the data points of these two variables are positively correlated.
# scatterplot of the correlated variable pair
df.housing |>
ggplot() +
geom_point(
aes(average_rooms, median_home_value),
shape = 21,
size = 2.5,
col = "cornflowerblue"
) +
labs(
title = "Scatterplot of Average Rooms and Median Home Value",
x = "Average Rooms",
y = "Median Home Value"
)
Last but not least, the coefficient of determination is:
# coefficient of determination
val.cor.avg_rooms_median_home_value |>
raise_to_power(2) |>
print()
## [1] 0.7915119
That is, around \(79\%\) of the variability of median home value is accounted for the average rooms.
Missing values introduce strong uncertainty into the picture. The way we handle missing values might depend on the number of such data points or the distribution of the affected variables.
One way to deal with them is to leave them out of subsequent analyses. In that case, however, one completely ignores these data points which might be useful otherwise. On the other hand, by replacing missing values, we need to specify the replacement values and be careful what fabricated values we introduce to the picture of our subsequent analyses.
As the previous sections indicate, there are two variables with missing values: crime rate and average rooms. The affected variables have missing values according to the following scatterplots and proportion visualizations.
# displaying median home values for which crime rate is missing
df.housing |>
mutate(
crime_rate = ifelse(is.na(crime_rate), -1, crime_rate)
) |>
ggplot() +
geom_point(
aes(crime_rate, median_home_value, col = crime_rate == -1),
shape = 21,
size = 2.5
) +
scale_color_manual(values = c("cornflowerblue", "red")) +
labs(
x = "Crime Rate",
y = "Median Home Value",
col = "X is missing"
)
# displaying median home values for which average rooms is missing
df.housing |>
mutate(
average_rooms = ifelse(is.na(average_rooms), -1, average_rooms)
) |>
ggplot() +
geom_point(
aes(average_rooms, median_home_value, col = average_rooms == -1),
shape = 21,
size = 2.5
) +
scale_color_manual(values = c("cornflowerblue", "red")) +
labs(
x = "Average Rooms",
y = "Median Home Value",
col = "X is missing"
)
# missing value proportions
## crime rate
df.housing |>
group_by(missing = is.na(crime_rate)) |>
count() |>
ungroup() |>
ggplot() +
geom_bar(
aes("x", n, fill = factor(missing)),
stat = "identity",
position = "fill"
) +
scale_fill_manual(values = c("cornflowerblue", "coral")) +
labs(
x = "Crime Rate",
y = "Fraction",
fill = "Missing data"
) +
theme(
axis.ticks.x = element_blank(),
axis.text.x = element_blank()
)
## average rooms
df.housing |>
group_by(missing = is.na(average_rooms)) |>
count() |>
ungroup() |>
ggplot() +
geom_bar(
aes("x", n, fill = factor(missing)),
stat = "identity",
position = "fill"
) +
scale_fill_manual(values = c("cornflowerblue", "coral")) +
labs(
x = "Average Rooms",
y = "Fraction",
fill = "Missing data"
) +
theme(
axis.ticks.x = element_blank(),
axis.text.x = element_blank()
)
The number of missing data in each of the two variables appears to be very small. Without delving into the numerous ways missing data could be replaced and its details, the median imputation will be used to do replacement in the above cases.
# missing crime rate replacement with median
df.housing |>
mutate(
crime_rate_replaced = is.na(crime_rate)
) |>
mutate(
crime_rate = ifelse(crime_rate_replaced,
median(crime_rate, na.rm = TRUE),
crime_rate
)
) |>
ggplot() +
geom_point(
aes(crime_rate, median_home_value, col = crime_rate_replaced),
shape = 21,
size = 2.5
) +
scale_color_manual(values = c("cornflowerblue", "red")) +
labs(
x = "Crime Rate",
y = "Median Home Value",
col = "Imputation"
)
# missing average rooms replacement with median
df.housing |>
mutate(
average_rooms_replaced = is.na(average_rooms)
) |>
mutate(
average_rooms = ifelse(average_rooms_replaced,
median(average_rooms, na.rm = TRUE),
average_rooms
)
) |>
ggplot() +
geom_point(
aes(average_rooms, median_home_value, col = average_rooms_replaced),
shape = 21,
size = 2.5
) +
scale_color_manual(values = c("cornflowerblue", "red")) +
labs(
x = "Average Rooms",
y = "Median Home Value",
col = "Imputation"
)
It appears, that the imputation does not visibly affect the crime rate variable, and as for the average rooms, only one or two strange points are introduced.
# summary before replacement
df.housing |>
select(crime_rate, average_rooms) |>
summary()
## crime_rate average_rooms
## Min. : 0.005305 Min. :4.112
## 1st Qu.: 1.299938 1st Qu.:5.598
## Median : 3.031481 Median :6.033
## Mean : 3.137415 Mean :6.026
## 3rd Qu.: 4.584798 3rd Qu.:6.460
## Max. :12.631829 Max. :7.801
## NA's :25 NA's :15
# replacement
val.df.housing.crime_rate.median <- median(df.housing$crime_rate, na.rm = TRUE)
val.df.housing.average_rooms.median <- median(df.housing$average_rooms, na.rm = TRUE)
df.housing.imputed <- df.housing |>
mutate(
crime_rate_replaced = is.na(crime_rate),
average_rooms_replaced = is.na(average_rooms)
) |>
mutate(
crime_rate = ifelse(crime_rate_replaced,
val.df.housing.crime_rate.median,
crime_rate
),
average_rooms = ifelse(average_rooms_replaced,
val.df.housing.average_rooms.median,
average_rooms
)
)
# summary after replacement
df.housing.imputed |>
select(crime_rate, average_rooms) |>
summary()
## crime_rate average_rooms
## Min. : 0.005305 Min. :4.112
## 1st Qu.: 1.375937 1st Qu.:5.605
## Median : 3.031481 Median :6.033
## Mean : 3.132181 Mean :6.026
## 3rd Qu.: 4.533860 3rd Qu.:6.451
## Max. :12.631829 Max. :7.801
Also, the correlation of average rooms and median home value remains almost the same.
# correlation of average rooms and median home value
val.cor.avg_rooms_median_home_value |>
print()
## [1] 0.8896695
cor(df.housing.imputed$average_rooms, df.housing.imputed$median_home_value, use = "complete.obs")
## [1] 0.8711099
One of the tasks of the project is to perform a hypothesis test regarding to the impact of crime rates on the median home values.
The objective is to determine if there’s a statistically significant difference in the median home values of neighborhoods with high crime rates compared to areas with low crime rates.
The null and alternative hypotheses are stated as follows:
\[ \begin{split} H_0:\text{There is no difference between the median home values of high and low crime-rate neighborhoods.} \\ H_A:\text{There is a difference between the median home values of high and low crime-rate neighborhoods.} \end{split} \]
The hypotheses can be expressed using the population means (\(\mu_L\) for low crime-rate areas, \(\mu_H\) for high crime-rate areas) as follows:
\[ \begin{split} H_0: \mu_L - \mu_H = 0 \\ H_A: \mu_L - \mu_H \neq 0 \end{split} \]
According to the above hypotheses, the test is a two-sided test, where the significance level and confidence interval are respectively the following:
\[ \begin{split} \alpha = 5\% = 0.05 \\ CI = (1-\alpha) = 95\% = 0.95 \end{split} \]
Since we need to use two samples to estimate the appropriate populations, the given dataset shall be split into two, based on the crime rate. First, the median value of the crime rate is designated as a threshold (\(T\)), at which the two crime rate categories will be separated from each other.
\[ \begin{split} T = median(\text{Crime Rate}) \\ Crime \ Category(x) = \begin{cases} High, & \text{if } x \ge T, \\ Low, & \text{otherwise}. \end{cases} \end{split} \]
Before further proceeding in the hypothesis test, we must address the fact that missing data of crime rate have been replaced by the median value. That is, if we designate the median as a threshold, we end up placing all the imputed (previously missing) values into the high crime category.
In concern with the uncertainty of the missing values, we should refine the way we impute these crime rate values. It would be better to still use the median for the replacement value as a base, yet increase, and decrease the two halves of the data points in question by an insignificant amount. So instead of just putting them into the dataset with the same value, which would result all of them to fall into one crime category, the imputation is made in a way that the values get scattered around the median.
# revisiting summaries of crime rate
## original
df.housing |>
select(crime_rate) |>
summary()
## crime_rate
## Min. : 0.005305
## 1st Qu.: 1.299938
## Median : 3.031481
## Mean : 3.137415
## 3rd Qu.: 4.584798
## Max. :12.631829
## NA's :25
## after median imputation
df.housing.imputed |>
select(crime_rate) |>
summary()
## crime_rate
## Min. : 0.005305
## 1st Qu.: 1.375937
## Median : 3.031481
## Mean : 3.132181
## 3rd Qu.: 4.533860
## Max. :12.631829
# refine imputation
## previous values of imputation (list of median)
val.imputation.values <- df.housing.imputed$crime_rate[df.housing.imputed$crime_rate_replaced == TRUE]
## generate deviates
val.imputation.deviates <- c(
runif(
val.imputation.values |> length() |> divide_by(2) |> add(1),
-5e-3,
0
),
runif(
val.imputation.values |> length() |> divide_by(2) |> add(1),
0,
5e-3
)
) |>
sample(val.imputation.values |> length())
## new version of the dataset
df.housing.imputed.refined <- df.housing.imputed
## update affected data points by deviates
df.housing.imputed.refined$crime_rate[df.housing.imputed.refined$crime_rate_replaced == TRUE] <- val.imputation.values |>
add(val.imputation.deviates)
## visualization of imputations and the original median
df.housing.imputed.refined |>
filter(crime_rate_replaced == TRUE) |>
ggplot() +
geom_point(
aes(crime_rate, median_home_value, col = (crime_rate < val.df.housing.crime_rate.median)),
size = 4,
alpha = 0.75
) +
geom_vline(xintercept = val.df.housing.crime_rate.median) +
scale_color_manual(values = c("purple", "coral")) +
labs(
x = "Crime Rate",
y = "Median Home Value",
col = "CR < median"
)
## proportion of imputations on each side of the original median
df.housing.imputed.refined |>
filter(crime_rate_replaced == TRUE) |>
mutate(
imputation_side = crime_rate >= val.df.housing.crime_rate.median
) |>
count(imputation_side) |>
pull(n) |>
prop.table()
## [1] 0.48 0.52
# after refined imputation
## summary
df.housing.imputed.refined |>
select(crime_rate) |>
summary()
## crime_rate
## Min. : 0.005305
## 1st Qu.: 1.375937
## Median : 3.031581
## Mean : 3.132169
## 3rd Qu.: 4.533860
## Max. :12.631829
## visualization
df.housing.imputed.refined |>
ggplot() +
geom_point(
aes(crime_rate, median_home_value, col = crime_rate_replaced),
shape = 21,
size = 2.5
) +
scale_color_manual(values = c("cornflowerblue", "red")) +
labs(
x = "Crime Rate",
y = "Median Home Value",
col = "Refined Imputation"
)
# separate sample into two, based on two crime categories
## threshold
val.crime_category_threshold <- val.df.housing.crime_rate.median
## add new variable
df.housing.imputed.refined <- df.housing.imputed.refined |>
mutate(
crime_category = ifelse(crime_rate >= val.crime_category_threshold,
"High",
"Low"
)
) |>
mutate(
crime_category = fct(crime_category)
) |>
relocate(
crime_category,
.after = crime_rate
)
Since the observations had been labeled using the median crime rate value as a threshold, the two categories are equally proportioned.
# proportions of the two categories
df.housing.imputed.refined |>
select(crime_category) |>
table() |>
prop.table() |>
round(3)
## crime_category
## High Low
## 0.502 0.498
As for a visual, the observations and the variables the hypothesis test is related to can be plotted as follows:
# hypothesis related scatter plot
df.housing.imputed.refined |>
ggplot() +
geom_point(
aes(crime_rate, median_home_value, col = crime_category),
shape = 21,
size = 2.5
) +
scale_color_manual(values = c("cornflowerblue", "coral")) +
labs(
title = "Median Home Value over Crime Rate",
subtitle = "points are distinguished based on their Crime Category",
x = "Crime Cate",
y = "Median Home Value",
col = "Crime Category"
)
To decide on what type of test to carry out, the following shall be considered:
The project guide advises to use t-test, which would use the sample statistics to estimate population parameters and would utilize a t-distribution with 504 degrees of freedom. However, a t-distribution with such a high degrees of freedom is very close to a normal distribution as depicted in the following diagram.
# degrees of freedom
val.degrees_of_freedom <- df.housing.imputed.refined |>
nrow() |>
subtract(2)
val.degrees_of_freedom |>
print()
## [1] 504
# n vs. t distribution
data.frame(x = c(-3, 3)) |>
ggplot(aes(x)) +
stat_function(fun = dnorm, lwd = 3, col = "coral") +
stat_function(fun = dt, args = list(df = val.degrees_of_freedom), lwd = .75) +
geom_label(label = "~N(0, 1)", x = -1, y = .25, fill = "white", col = "coral", size = 5) +
geom_label(label = "df = 504", x = 1, y = .25, size = 5)
In that light, I decided to proceed as follows:
The z-test formula for a one-sample and a two-sample test, respectively:
\[ \begin{split} Z = \frac{val - EV}{SE} \\ \\ Z = \frac{diff - EV}{SE_{diff}} = \frac{diff - EV}{\sqrt(SE_1^2 + SE_2^2)} \end{split} \]
The standard error, using the sample standard deviation:
\[ SE = \frac{s}{\sqrt{n}} \]
Using the notations of the sample means and standard deviations, the following formula is established:
\[
Z = \frac{|\overline{x}_L - \overline{x}_H| - 0}{\sqrt{\frac{s_L^2}{n_L}
+ \frac{s_H^2}{n_H}}}
\] Using R to calculate the z-score:
# rownum of crime categories, mean and standard deviation of their target variable
mtx.samples_n_mean_sd <- c(
df.housing.imputed.refined |>
filter(crime_category == "Low") |>
nrow(),
df.housing.imputed.refined |>
filter(crime_category == "Low") |>
pull(median_home_value) |>
mean(),
df.housing.imputed.refined |>
filter(crime_category == "Low") |>
pull(median_home_value) |>
sd(),
df.housing.imputed.refined |>
filter(crime_category == "High") |>
nrow(),
df.housing.imputed.refined |>
filter(crime_category == "High") |>
pull(median_home_value) |>
mean(),
df.housing.imputed.refined |>
filter(crime_category == "High") |>
pull(median_home_value) |>
sd()
) |>
matrix(nrow = 2, ncol = 3, byrow = TRUE)
colnames(mtx.samples_n_mean_sd) <- c("n", "Mean", "Standard deviation")
rownames(mtx.samples_n_mean_sd) <- c("Low crime category", "High crime category")
mtx.samples_n_mean_sd |>
print()
## n Mean Standard deviation
## Low crime category 252 46.74135 5.230683
## High crime category 254 47.46291 5.722151
# z-score calculation
z.score <- (
(mtx.samples_n_mean_sd[1, "Standard deviation"]^2 / mtx.samples_n_mean_sd[1, "n"]) |>
add((mtx.samples_n_mean_sd[2, "Standard deviation"]^2 / mtx.samples_n_mean_sd[2, "n"]))
) |>
sqrt() |>
raise_to_power(-1) |>
multiply_by(
abs(mtx.samples_n_mean_sd[1, "Mean"] |>
subtract(mtx.samples_n_mean_sd[2, "Mean"])) |>
subtract(0)
)
The z-score is the following:
# z-score
z.score |>
print()
## [1] 1.480672
There are two ways to proceed, in order to evaluate how this score relates to our hypotheses.
The p-value is calculated by summarizing the area of the normal distribution curve under the following intervals:
data.frame(z = c(-4, 4)) |>
ggplot(aes(z)) +
# normal curve
stat_function(fun = dnorm, lwd = 1) +
stat_function(fun = dnorm, geom = "area", fill = "lightgrey", lwd = 1) +
# rejection area fractions
stat_function(
fun = dnorm,
xlim = c(-4, qnorm(.025)),
geom = "area",
fill = "red2"
) +
stat_function(
fun = dnorm,
xlim = c(qnorm(.975), 4),
geom = "area",
fill = "red2"
) +
## corresponding z-scores
geom_vline(xintercept = qnorm(.025), col = "red2") +
geom_label(x = qnorm(.025), y = .15, label = qnorm(.025) |> round(4), col = "red2") +
geom_vline(xintercept = qnorm(.975), col = "red2") +
geom_label(x = qnorm(.975), y = .15, label = qnorm(.975) |> round(4), col = "red2") +
# p-value fractions
stat_function(
fun = dnorm,
xlim = c(-4, -z.score),
geom = "area",
fill = "purple2",
alpha = 0.5
) +
stat_function(
fun = dnorm,
xlim = c(z.score, 4),
geom = "area",
fill = "purple2",
alpha = 0.5
) +
## corresponding z-scores
geom_vline(xintercept = -z.score, col = "purple2") +
geom_label(x = -z.score, y = .35, label = -z.score |> round(4), col = "purple2") +
geom_vline(xintercept = z.score, col = "purple2") +
geom_label(x = z.score, y = .35, label = z.score |> round(4), col = "purple2")
According to the calculations and the visual, the absolute z-score associated with the half of the p-value on both tails is less than the z-score associated with the significance level, that is stands outside the rejection area, that is the p-value exceeds the significance level.
In that light, we can conclude that there is no statistically significant evidence to reject the null hypothesis, that is, the observed difference between median home prices in the two groups based on crime rate is due to chance and not due to the statistical significance of the observation.
Simply put, we do not reject \(H_0\), the null hypothesis.
The t-score can be calculated the same way as in the previous z-score equations, however the p-value is obtained from a t-distribution with a previously determined degrees of freedom of \(n_L + n_H - 2\).
# t-score
t.score <- z.score
# degrees of freedom
val.degrees_of_freedom |>
print()
## [1] 504
The p-value is calculated by summing the areas of the t-distribution curve under the following intervals:
As mentioned before, a t-distribution with such a high degrees of freedom closely approximates a normal distribution.
data.frame(t = c(-4, 4)) |>
ggplot(aes(t)) +
# t curve
stat_function(fun = dt, args = list(df = val.degrees_of_freedom), lwd = 1) +
stat_function(fun = dt, args = list(df = val.degrees_of_freedom), geom = "area", fill = "lightgrey", lwd = 1) +
# rejection area fractions
stat_function(
fun = dt,
args = list(df = val.degrees_of_freedom),
xlim = c(-4, qt(.025, df = val.degrees_of_freedom)),
geom = "area",
fill = "red2"
) +
stat_function(
fun = dt,
args = list(df = val.degrees_of_freedom),
xlim = c(qt(.975, df = val.degrees_of_freedom), 4),
geom = "area",
fill = "red2"
) +
## corresponding t-scores
geom_vline(xintercept = qt(.025, df = val.degrees_of_freedom), col = "red2") +
geom_label(x = qt(.025, df = val.degrees_of_freedom), y = .15, label = qt(.025, df = val.degrees_of_freedom) |> round(4), col = "red2") +
geom_vline(xintercept = qt(.975, df = val.degrees_of_freedom), col = "red2") +
geom_label(x = qt(.975, df = val.degrees_of_freedom), y = .15, label = qt(.975, df = val.degrees_of_freedom) |> round(4), col = "red2") +
# p-value fractions
stat_function(
fun = dt,
args = list(df = val.degrees_of_freedom),
xlim = c(-4, -t.score),
geom = "area",
fill = "purple2",
alpha = 0.5
) +
stat_function(
fun = dt,
args = list(df = val.degrees_of_freedom),
xlim = c(t.score, 4),
geom = "area",
fill = "purple2",
alpha = 0.5
) +
## corresponding t-scores
geom_vline(xintercept = -t.score, col = "purple2") +
geom_label(x = -t.score, y = .35, label = -t.score |> round(4), col = "purple2") +
geom_vline(xintercept = t.score, col = "purple2") +
geom_label(x = t.score, y = .35, label = t.score |> round(4), col = "purple2")
The slight differences can be observed in the t-score associated with the rejection area. It is a bit more far away from zero than in the case of the z-score, because now the tails are slightly fatter.
Verifying the above calculations by running
stats::t.test.
# stats::t.test
t.test(
formula = median_home_value ~ crime_category,
data = df.housing.imputed.refined,
alternative = "two.sided",
mu = 0,
conf.level = 0.95,
var.equal = TRUE
)
##
## Two Sample t-test
##
## data: median_home_value by crime_category
## t = 1.4801, df = 504, p-value = 0.1395
## alternative hypothesis: true difference in means between group High and group Low is not equal to 0
## 95 percent confidence interval:
## -0.2362075 1.6793308
## sample estimates:
## mean in group High mean in group Low
## 47.46291 46.74135
Another task of the project is to perform a regression, to predict median home value based on a suitable predictor.
It has been shown that the target variable exhibits correlation with the average rooms one. Since they seem to align in a linear fashion, a regression of linear would suit the needs.
# linear regression
df.housing.imputed.refined |>
ggplot(aes(average_rooms, median_home_value)) +
geom_point(
shape = 21,
size = 2.5,
color = "cornflowerblue"
) +
geom_smooth(
formula = y ~ x,
method = "lm",
se = FALSE,
col = "coral",
fullrange = TRUE
) +
scale_x_continuous(
limits = c(0, 12),
breaks = 0:12
) +
scale_y_continuous(
breaks = seq(0, 100, by = 5)
) +
labs(
title = "Linear regression scatterplot of Average Rooms vs Median Home Value",
x = "Average Rooms",
y = "Median Home Value"
)
# linear regression model
mdl.linear_regression <- lm(
formula = median_home_value ~ average_rooms,
data = df.housing.imputed.refined
)
mdl.linear_regression |>
summary()
##
## Call:
## lm(formula = median_home_value ~ average_rooms, data = df.housing.imputed.refined)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6212 -1.6873 0.0096 1.7347 13.9495
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7854 1.0944 3.459 0.000588 ***
## average_rooms 7.1886 0.1805 39.823 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.698 on 504 degrees of freedom
## Multiple R-squared: 0.7588, Adjusted R-squared: 0.7584
## F-statistic: 1586 on 1 and 504 DF, p-value: < 2.2e-16
According to the linear regression model, the regression line intercepts the ordinate at \(3.7854\) and has a slope of \(7.1886\), from which the following regression equation can be drawn:
\[ \begin{split} \hat{y}(x) = \beta_0 + \beta_1\times{x} \\ \hat{y}(x) = 3.7854 + 7.1886\times{x} \end{split} \]
With missing values being replaced, the coefficient of determination (\(R^2\)) indicates that around \(75\%\) of the variation of the dependent variable (median home value) can be accounted for the independent variable (average rooms).
This linear regression model indicates that, on average, each additional room results in a \(\$7188.6\) increase in the median home value.